Exercise: Fitting a Polynomial Curve¶

In this exercise, we'll have a look at a different type of regression called polynomial regression. In contrast to linear regression ,which models relationships as straight lines, polynomial regression models relationships as curves.

Recall in our previous exercise how the relationship between core_temperature and protein_content_of_last_meal couldn't be properly explained using a straight line. In this exercise, we'll use polynomial regression to fit a curve to the data instead.

Data visualization¶

Let's start this exercise by loading and having a look at our data.

In [1]:
import pandas
!pip install statsmodels
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-illness.csv

#Import the data from the .csv file
dataset = pandas.read_csv('doggy-illness.csv', delimiter="\t")

#Let's have a look at the data
dataset
Requirement already satisfied: statsmodels in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (0.11.0)
Requirement already satisfied: pandas>=0.21 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (1.1.5)
Requirement already satisfied: patsy>=0.5 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (0.5.2)
Requirement already satisfied: scipy>=1.0 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (1.5.3)
Requirement already satisfied: numpy>=1.14 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (1.21.6)
Requirement already satisfied: python-dateutil>=2.7.3 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas>=0.21->statsmodels) (2.8.2)
Requirement already satisfied: pytz>=2017.2 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas>=0.21->statsmodels) (2022.1)
Requirement already satisfied: six in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from patsy>=0.5->statsmodels) (1.16.0)
--2023-08-23 13:20:47--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21511 (21K) [text/plain]
Saving to: ‘graphing.py.2’

graphing.py.2       100%[===================>]  21.01K  --.-KB/s    in 0s      

2023-08-23 13:20:48 (134 MB/s) - ‘graphing.py.2’ saved [21511/21511]

--2023-08-23 13:20:49--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-illness.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3293 (3.2K) [text/plain]
Saving to: ‘doggy-illness.csv.2’

doggy-illness.csv.2 100%[===================>]   3.22K  --.-KB/s    in 0s      

2023-08-23 13:20:49 (48.5 MB/s) - ‘doggy-illness.csv.2’ saved [3293/3293]

Out[1]:
male attended_training age body_fat_percentage core_temperature ate_at_tonys_steakhouse needed_intensive_care protein_content_of_last_meal
0 0 1 6.9 38 38.423169 0 0 7.66
1 0 1 5.4 32 39.015998 0 0 13.36
2 1 1 5.4 12 39.148341 0 0 12.90
3 1 0 4.8 23 39.060049 0 0 13.45
4 1 0 4.8 15 38.655439 0 0 10.53
... ... ... ... ... ... ... ... ...
93 0 0 4.5 38 37.939942 0 0 7.35
94 1 0 1.8 11 38.790426 1 1 12.18
95 0 0 6.6 20 39.489962 0 0 15.84
96 0 0 6.9 32 38.575742 1 1 9.79
97 1 1 6.0 21 39.766447 1 1 21.30

98 rows × 8 columns

Simple Linear Regression¶

Let's quickly jog our memory by performing the same simple linear regression as we did in the previous exercise, using the temperature and protein_content_of_last_meal columns of the dataset.

In [2]:
import statsmodels.formula.api as smf
import graphing # custom graphing code. See our GitHub repo for details

# Perform linear regression. This method takes care of
# the entire fitting procedure for us.
simple_formula = "core_temperature ~ protein_content_of_last_meal"
simple_model = smf.ols(formula = simple_formula, data = dataset).fit()

# Show a graph of the result
graphing.scatter_2D(dataset, label_x="protein_content_of_last_meal", 
                             label_y="core_temperature",
                             trendline=lambda x: simple_model.params[1] * x + simple_model.params[0])

Notice how the relationship between the two variables is not truly linear. Looking at the plot, it's fairly clear to see that the points tend more heavily towards one side of the line, especially for the higher core-temperature and protein_content_of_last_meal values.

A straight line might not be the best way to describe this relationship.

Let's have a quick look at the model's R-Squared score:

In [3]:
print("R-squared:", simple_model.rsquared)
R-squared: 0.9155158150005706

That is quite a reasonable R-Squared score, but let's see if we can get an even better one!

Simple Polynomial Regression¶

Let's fit a simple polynomial regression this time. Similar to a simple linear regression, a simple polynomial regression models the relationship between a label and a single feature. Unlike a simple linear regression, a simple polynomial regression can explain relationships that aren't simply straight lines.

In our example, we're going to use a three-parameter polynomial.

In [4]:
# Perform polynomial regression. This method takes care of
# the entire fitting procedure for us.
polynomial_formula = "core_temperature ~ protein_content_of_last_meal + I(protein_content_of_last_meal**2)"
polynomial_model = smf.ols(formula = polynomial_formula, data = dataset).fit()

# Show a graph of the result
graphing.scatter_2D(dataset, label_x="protein_content_of_last_meal", 
                             label_y="core_temperature",
                             # Our trendline is the equation for the polynomial
                             trendline=lambda x: polynomial_model.params[2] * x**2 + polynomial_model.params[1] * x + polynomial_model.params[0])

That looks a lot better already. Let's confirm by having a quick look at the R-Squared score:

In [5]:
print("R-squared:", polynomial_model.rsquared)
R-squared: 0.9514426069911689

That's a better R-Squared score than the one obtained from the previous model! We can now confidently tell our vet to prioritize dogs who ate a high-protein diet the night before.

Let's chart our model as a 3D chart. We'll view $X$ and $X^2$ as two separate parameters. Notice that if you rotate the visual just right, our regression model is still a flat plane. This is why polynomial models are still considered to be linear models.

In [6]:
import numpy as np
fig = graphing.surface(
    x_values=np.array([min(dataset.protein_content_of_last_meal), max(dataset.protein_content_of_last_meal)]),
    y_values=np.array([min(dataset.protein_content_of_last_meal)**2, max(dataset.protein_content_of_last_meal)**2]),
    calc_z=lambda x,y: polynomial_model.params[0] + (polynomial_model.params[1] * x) + (polynomial_model.params[2] * y),
    axis_title_x="x",
    axis_title_y="x2",
    axis_title_z="Core temperature"
)
# Add our datapoints to it and display
fig.add_scatter3d(x=dataset.protein_content_of_last_meal, y=dataset.protein_content_of_last_meal**2, z=dataset.core_temperature, mode='markers')
fig.show()

Extrapolating¶

Let's see what happens if we extrapolate our data. We'd like to see if dogs that ate meals even higher in protein are expected to get even sicker.

Let's start with the linear regression. We can set what range we'd like to extrapolate our data over by using the x_range argument in the plotting function. Let's extrapolate over the range [0,100]:

In [7]:
# Show an extrapolated graph of the linear model
graphing.scatter_2D(dataset, label_x="protein_content_of_last_meal", 
                             label_y="core_temperature",
                             # We extrapolate over the following range
                             x_range = [0,100],
                             trendline=lambda x: simple_model.params[1] * x + simple_model.params[0])

Next, we extrapolate the polynomial regression over the same range:

In [8]:
# Show an extrapolated graph of the polynomial model
graphing.scatter_2D(dataset, label_x="protein_content_of_last_meal", 
                             label_y="core_temperature",
                             # We extrapolate over the following range
                             x_range = [0,100],
                             trendline=lambda x: polynomial_model.params[2] * x**2 + polynomial_model.params[1] * x + polynomial_model.params[0])

These two graphs predict two very different things!

The extrapolated polynolmial regression expects core_temperature to go down, while the extrapolated linear regression expects linear expects core_temperature to go up.

A quick look at the graphs obtained in the previous exercise confirms that we should expect the core_temeprature to rise, not fall, as the protein_content_of_last_meal increases.

In general, it's not recommended to extrapolate from a polynomial regression unless you have an a-priori reason to do so (which is only very rarely the case, so it's best to err on the side of caution, and never extrapolate from polynomial regressions).

Summary¶

We covered the following concepts in this exercise:

  • Built simple linear regression and simple polynomial regression models.
  • Compared the performance of both models by plotting them and looking at R-Squared values.
  • Extrapolated the models over a wider range of values.